To reiterate, it looks like the Qi code provided with the 2006 paper does not include anything for the classification - I guess they didn't see any point seeing as the classification can be performed with plenty of different toolsets and they probably used one they couldn't just share. In any case, I want to replicate their results so I've got to get the feature vectors they had and get out some Python classification algorithms and see if I can get similar results to the paper.
The code is basically just a couple of perl scripts, and I think I should be able to just run these without any hassle and hopefully this will give me the feature vectors. Unfortunately, if anything goes wrong it's going to take a while to work out because I don't know Perl.
In [1]:
ls
In [5]:
%%bash
head -n 30 Batch_feature_ExtractWrapper.pl
Ok, so the documentation isn't very extensive, the important part is that perl command inputPairlist. Where I guess the command is the name of the wrapper?
Anyway, which file is the inputpairlist? I think it's in the 0yeast_gene_list/ directory.
In [6]:
cd 0yeast_gene_list/
In [7]:
ls
In [9]:
%%bash
head YeastGeneListOrfGeneName-106_pval_v9.0.txt
Guess this maps one reference name for a gene to another?
In [11]:
%%bash
head -n 30 make_fullpair_4protein.pl
Slightly confused about this. I'm guessing that this is to produce the input pair list for the main wrapper but I don't see where to get pos_pairFile, outPosPairFile, outRandPairFile. Trying to find: Science03-pos_MIPS_complexes.txt, mipsPosPair.txt mipsRandpair.txt:
In [ ]:
cd ..
In [31]:
%%bash
find . -name Science*
find . -name mips*
No sign of them.
I guess I'm supposed to get them from elsewhere maybe? Doesn't make any sense though, why tarball up all the code but not make it runnable?
Ok, maybe I'm just being stupid and two of those are output files? So we'd just need a positive list of proteins? Is that in here somewhere? Don't see it.
Looks like I can get it online at this page on the Qi site
Downloaded them and put them in a directory called featuresets. Can have a look at them now:
In [1]:
ls -lh featuresets/phyInteract/
In [16]:
%%bash
head -n 1 featuresets/phyInteract/dipsPosPair.feature
In [11]:
%%bash
head -n 1 featuresets/phyInteract/dipsRandpairSub23w.feature
Opening up these files:
In [1]:
import csv,glob
nipfiles = glob.glob("featuresets/phyInteract/dips*.feature")
print nipfiles
#write generator functions for each file will require generator object
One option would be to write a generator function to open these up and return rows as they are required to save RAM. If I get round to writing this I'll put it here:
In [ ]:
#writing generator class
Another option would be to use pandas as it's pretty much designed to solve this problem.
In [2]:
import pandas as pd
#test importing one of these csvs
pd.read_csv(nipfiles[0],header=None).head()
Out[2]:
In [3]:
#initialise dictionary
nipd = {}
#load into dictionary
for fname in nipfiles:
nipd[fname] = pd.read_csv(fname,header=None)
Now just need to see if we can extract the array that we need to train the logistic regression model. Looks like I should use the iloc indexing method.
In [5]:
#concatenate the positive and negative training sets together
X = pd.concat((nipd[nipfiles[0]].iloc[:10,0:162],nipd[nipfiles[1]].iloc[:10,0:162]))
#get the class label vector y the same way
y = pd.concat((nipd[nipfiles[0]].iloc[:10,162],nipd[nipfiles[1]].iloc[:10,162]))
In [6]:
#checking that our arrays will be the right size
print "length of X array is %i"%len(X.values)
print "length of y array is %i"%len(y.values)
The different classification algorithms to test to replicate Qi's results are:
Starting with good old Naive Bayes as I pretty well understand how it works.
Simple classifier assuming that the features are independent given the class label, which is apparently a pretty good assumption most of the time. Naive Bayes is usually studied with binary features but those used in this example are clearly not all binary. So, how do we get Scikit-learn to deal with that?
Specifically, the features are encoded as described here.
Looks like the ways round are both pretty much just hacks so it's probably better to use something else. If I have to write my own then this page might come in useful. Don't particularly want to though.
This one's pretty simple as well, so shouldn't be too difficult to run. On the logistic regression page it appears that this can be used with defaults. As far as I understand logistic regression it doesn't make any assumptions about the distributions of the features.
In [7]:
from sklearn import linear_model as sklm
#initialise a logistic regression model
logmodel = sklm.LogisticRegression()
Now the question is what the fit command expects of the data. What format should it be in?
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vector, where n_samples in the number of samples and n_features is the number of features.
y : array-like, shape = [n_samples]
Target vector relative to X
Now trying to fit the model. Appears the following cell will take a very long time to run with the full number of samples (240000). Unclear what I can do about this.
In [8]:
logmodel.fit(X.values,y.values)
Out[8]:
Now all we need is a test set.
In [9]:
#concatenate the positive and negative training sets together
testX = pd.concat((nipd[nipfiles[0]].iloc[10:20,0:162],nipd[nipfiles[1]].iloc[10:20,0:162]))
#get the class label vector y the same way
testy = pd.concat((nipd[nipfiles[0]].iloc[10:20,162],nipd[nipfiles[1]].iloc[10:20,162]))
In [14]:
estimates = logmodel.predict_proba(testX.values)
Then compare these to the real results. Is there an automated way to do this? Of course there is: Scikit-learn's ROC curve.
In [21]:
from sklearn.metrics import roc_curve,auc
fpr, tpr, thresholds = roc_curve(testy.values,estimates[:,1])
roc_auc = auc(fpr,tpr)
print "Area under ROC curve: %.2f"%(roc_auc)
In [24]:
#define a function for quickly plotting ROC curves with nice annotations
def plotroc(fpr,tpr):
clf()
plot(fpr, tpr)
plot([0, 1], [0, 1], 'k--')
xlim([0.0, 1.0])
ylim([0.0, 1.0])
xlabel('False Positive Rate')
ylabel('True Positive Rate')
title('Receiver operating characteristic')
show()
return None
plotroc(fpr,tpr)
Ok, so that's terrible, probably because we're only using 20 training and test examples. Can now increase that to 1000:
In [44]:
from sklearn.utils import shuffle as skshuffle
#how many samples?
N = 1000
#concatenate the positive and negative training sets together
X = pd.concat((nipd[nipfiles[0]].iloc[:N,0:162],nipd[nipfiles[1]].iloc[:N,0:162]))
#get the class label vector y the same way
y = pd.concat((nipd[nipfiles[0]].iloc[:N,162],nipd[nipfiles[1]].iloc[:N,162]))
#then shuffle all of them
X,y = skshuffle(X.values,y.values)
#find the midpoint
half = int(len(X))/2
#split it into test and train
Xtrain, Xtest = X[:half], X[half:]
ytrain, ytest = y[:half], y[half:]
Then we want to fit the model again and test it again:
In [45]:
#reinitialise a logistic regression model
logmodel = sklm.LogisticRegression()
#fit it again
logmodel.fit(Xtrain,ytrain)
#test it again
estimates = logmodel.predict_proba(Xtest)
#recalc the roc curve
fpr, tpr, thresholds = roc_curve(ytest,estimates[:,1])
#print area under roc curve
roc_auc = auc(fpr,tpr)
print "Area under ROC curve: %.2f"%(roc_auc)
#replot the roc curve
plotroc(fpr,tpr)
However, this is strange, because we're not even using the full data and we're getting better results than the paper. Should be getting an AUC score of about 0.15 - what we are getting is 0.90.
Although, the number they're quoting in the paper is something called an R50 value, which is:
...R50 is a partial AUC score that measures the area under the ROC curve until reaching 50 negative predictions.
So we need to calculate this R50 value if we're going to replicate the results.
In this case we have 500 training points so the R50 value corresponds to (I guess?) the area under the curve up until the false positive rate is 0.1.
In [58]:
roccrv = pd.DataFrame(array([fpr,tpr]).T)
#don't ask about this line
plotroc(*zip(*roccrv[roccrv[0]<0.1].values))
In [60]:
r50 = auc(*zip(*roccrv[roccrv[0]<0.1].values))
print "The R50 value for this ROC curve is %.2f"%r50
Which could be right, it's certainly below what they get in the paper and it's obtained using much less training data than used in the paper. Unfortunately, I still can't find another source for what an R50 actually is. I doubt what I'm doing is 100% correct because then the R50 value you get would depend on the size of your test set (if my training set were smaller I could tolerate a higher false positive rate when finding the R50 value, which would obviously improve my results.
In lieu of any of that I'm just going to try and train this model with 80% of the data and see if the result is exactly what they have in the paper. If it is, I'll just assume my interpretation of the R50 value is right.
In [61]:
#concatenate the positive and negative training sets together
X = pd.concat((nipd[nipfiles[0]].iloc[:,0:162],nipd[nipfiles[1]].iloc[:,0:162]))
#get the class label vector y the same way
y = pd.concat((nipd[nipfiles[0]].iloc[:,162],nipd[nipfiles[1]].iloc[:,162]))
#then shuffle all of them
X,y = skshuffle(X.values,y.values)
#find the midpoint
eighty = int(len(X))*0.8
#split it into test and train
Xtrain, Xtest = X[:eighty], X[eighty:]
ytrain, ytest = y[:eighty], y[eighty:]
Ignoring the above warnings for now.
Ran the code below, took a while (less than an hour though):
In [62]:
#reinitialise a logistic regression model
logmodel = sklm.LogisticRegression()
#fit it again
logmodel.fit(Xtrain,ytrain)
#test it again
estimates = logmodel.predict_proba(Xtest)
#recalc the roc curve
fpr, tpr, thresholds = roc_curve(ytest,estimates[:,1])
#print area under roc curve
roc_auc = auc(fpr,tpr)
print "Area under ROC curve: %.2f"%(roc_auc)
#replot the roc curve
plotroc(fpr,tpr)
To recalculate the R50 will need to know the size of the test set:
In [63]:
len(ytest)
Out[63]:
So we can only tolerate a false positive rate of:
In [69]:
r50fpr = 50.0/len(ytest)
print r50fpr
Then we can calculate the R50 for this:
In [70]:
roccrv = pd.DataFrame(array([fpr,tpr]).T)
r50 = auc(*zip(*roccrv[roccrv[0]<r50fpr].values))
print "The R50 value for this ROC curve is %.2f"%r50
So that can't be right. I'll ask the Machine Learning professors if they've heard of this value. In the mean time I think I can just recreate the test set used in the paper using the quote:
...in our random test set, there are approximately 50 positive items (because 1 in 600 pairs are interacting and we selected 30,000 pairs)...
What's the problem then? does the test set have too many positive items in it? The training and test have been shuffled in this example so how many positive examples are in there?
In [72]:
sum(ytest)
Out[72]:
That's a lot more than 50...
Only other lead is that the paper says it uses a LR classifier that:
...uses a ridge estimator for building a binary LR model.
This is already a binary LR model so I guess I should make sure it's using a ridge estimator - ie make sure it is regularised.
Asked the Machine Learning professors about the R50 value and they had never heard of it.
In [ ]: